R Bootcamp 2020
# Make tidyverse load quietly
options(tidyverse.quiet = TRUE)
library(tidyverse) #< General use
library(tidyr) #< Reshaping
library(wbstats) #< World bank data.
library(countrycode) #< Country coding
library(assertthat) #< Results checking
library(lubridate) #< Date manipulations
requireNamespace('zoo')#< time series
for .RData files use load()
load("data/ebola.data.RData")
for .rds files use readRDS() and capture the results in a variable.
ebola <- readRDS("data/ebola.data.rds")
Any problems with the data?
The Most obvious is that Country Report Date should be repeated down the rows.
Fix with dplyr::mutate() + zoo::na.locf() (Missing last observation carried forward.)
mutate() variantsmutate() - modify/add variablesmutate_at() - modify a set of variables.mutate_if() - modify variables meeting a criteriatransmute() - create a new set of variables based on previous.ebola.data %<>%
mutate_at('Country Report Date', zoo::na.locf)
%<>% to modify in place. Generally this is frowned uponselect() - Choosing variablesUse select() to choose the variables desired.
select(data, ...)
Over the next few examples we will explore the forms ... can take
select() - Variable Namesthe easiest is with variable names:
ebola.data %>% select(Country, `Case def.`, `Total cases`) %>% head()
| Country | Case def. | Total cases |
|---|---|---|
| Guinea | Confirmed | 3351 |
| Guinea | Probable | 453 |
| Guinea | Suspected | 0 |
| Guinea | All | 3804 |
| Liberia | Confirmed | 3151 |
| Liberia | Probable | 1879 |
select() - Dropping by Variable NamesYou can select everything but a variable with the minus operator
ebola.data %>% select(-`Total cases`) %>% head()
| SheetName | Country | Case def. | Total deaths | Country Report Date |
|---|---|---|---|---|
| Jan 06, 2016 | Guinea | Confirmed | 2083 | 2015-12-27 |
| Jan 06, 2016 | Guinea | Probable | 453 | 2015-12-27 |
| Jan 06, 2016 | Guinea | Suspected | 0 | 2015-12-27 |
| Jan 06, 2016 | Guinea | All | 2536 | 2015-12-27 |
| Jan 06, 2016 | Liberia | Confirmed | 0 | 2015-05-09 |
| Jan 06, 2016 | Liberia | Probable | 0 | 2015-05-09 |
select() - By the numbersYou can select by variable position as well.
ebola.data %>% select(1:4) %>% head()
| SheetName | Country | Case def. | Total cases |
|---|---|---|---|
| Jan 06, 2016 | Guinea | Confirmed | 3351 |
| Jan 06, 2016 | Guinea | Probable | 453 |
| Jan 06, 2016 | Guinea | Suspected | 0 |
| Jan 06, 2016 | Guinea | All | 3804 |
| Jan 06, 2016 | Liberia | Confirmed | 3151 |
| Jan 06, 2016 | Liberia | Probable | 1879 |
select() - by variable rangeUse single colon : with variable names to select variables named and eveything in between:
ebola.data %>% select(Country:`Total cases`) %>% head()
| Country | Case def. | Total cases |
|---|---|---|
| Guinea | Confirmed | 3351 |
| Guinea | Probable | 453 |
| Guinea | Suspected | 0 |
| Guinea | All | 3804 |
| Liberia | Confirmed | 3151 |
| Liberia | Probable | 1879 |
select() - by helpersselection helpers are also provided:
ebola.data %>% select(starts_with("Total")) %>% head()
| Total cases | Total deaths |
|---|---|
| 3351 | 2083 |
| 453 | 453 |
| 0 | 0 |
| 3804 | 2536 |
| 3151 | 0 |
| 1879 | 0 |
select() - The helpersThe available helpers are:
starts_with()ends_with()contains() - must match literallymatches() - Regular expression matchnum_range() - numerical ranged variables with a prefixall_of() - must match all of given variables named in a vector.any_of() - select any variables present in given vector, but no error if not present.everything() - Matches all variables, useful when reordering variables.last_col() - The last columnselect() - MultipleYou may use multiple forms together
ebola.data %>% select(last_col(), 2:3, `Total cases`) %>% head()
| Country Report Date | Country | Case def. | Total cases |
|---|---|---|---|
| 2015-12-27 | Guinea | Confirmed | 3351 |
| 2015-12-27 | Guinea | Probable | 453 |
| 2015-12-27 | Guinea | Suspected | 0 |
| 2015-12-27 | Guinea | All | 3804 |
| 2015-05-09 | Liberia | Confirmed | 3151 |
| 2015-05-09 | Liberia | Probable | 1879 |
Subset data with the filter() function.
The base R version is subset, but it is FAR less robust.
It takes the form of
filter(data, expr1, expr2, ...)
where data is the data set, and expr1, expr2, … are the criteria expressions evaluated in the context of the data. Data must meet all ctriteria to remain.
filter() ExampleSubset data to only confirmed cases for Nigeria.
filter( ebola.data
, Country == 'Nigeria'
, `Case def.` == 'Confirmed'
)
| SheetName | Country | Case def. | Total cases | Total deaths | Country Report Date |
|---|---|---|---|---|---|
| Jan 06, 2016 | Nigeria | Confirmed | 19 | 7 | 2014-10-19 |
| Dec 30, 2015 | Nigeria | Confirmed | 19 | 7 | 2014-10-19 |
| Dec 23, 2015 | Nigeria | Confirmed | 19 | 7 | 2014-10-19 |
| Dec 16 | Nigeria | Confirmed | 19 | 7 | 2014-10-19 |
| Dec 9 | Nigeria | Confirmed | 19 | 7 | 2014-10-19 |
| Dec 2 | Nigeria | Confirmed | 19 | 7 | 2014-10-19 |
| Nov 25 | Nigeria | Confirmed | 19 | 7 | 2014-10-19 |
| Nov 18 | Nigeria | Confirmed | 19 | 7 | 2014-10-19 |
| Nov 11 | Nigeria | Confirmed | 19 | 7 | 2014-10-19 |
| Nov 4 | Nigeria | Confirmed | 19 | 7 | 2014-10-19 |
| Oct 28 | Nigeria | Confirmed | 19 | 7 | 2014-10-19 |
| Oct 21 | Nigeria | Confirmed | 19 | 7 | 2014-10-19 |
| Oct 14 | Nigeria | Confirmed | 19 | 7 | 2014-10-19 |
| Oct 7 | Nigeria | Confirmed | 19 | 7 | 2014-10-19 |
| Sep 30 | Nigeria | Confirmed | 19 | 7 | 2014-10-19 |
| Sep 23 | Nigeria | Confirmed | 19 | 7 | 2014-10-19 |
| Sep 16 | Nigeria | Confirmed | 19 | 7 | 2014-10-19 |
| Sep 9 | Nigeria | Confirmed | 19 | 7 | 2014-10-19 |
| Sep 2 | Nigeria | Confirmed | 19 | 7 | 2014-10-19 |
| Aug 26 | Nigeria | Confirmed | 19 | 7 | 2014-10-19 |
| Aug 19 | Nigeria | Confirmed | 19 | 7 | 2014-10-19 |
| Aug 12 | Nigeria | Confirmed | 19 | 7 | 2014-10-19 |
| Aug 5 | Nigeria | Confirmed | 19 | 7 | 2014-10-19 |
| July 29 | Nigeria | Confirmed | 19 | 7 | 2014-10-19 |
| July 22 | Nigeria | Confirmed | 19 | 7 | 2014-10-19 |
| July 15 | Nigeria | Confirmed | 19 | 7 | 2014-10-19 |
| July 8 | Nigeria | Confirmed | 19 | 7 | 2014-10-19 |
| July 1 | Nigeria | Confirmed | 19 | 7 | 2014-10-19 |
| June 24 | Nigeria | Confirmed | 19 | 7 | 2014-10-19 |
| June 17 | Nigeria | Confirmed | 19 | 7 | 2014-10-19 |
| June 10 | Nigeria | Confirmed | 19 | 7 | 2014-10-19 |
| June 3 | Nigeria | Confirmed | 19 | 7 | 2014-10-19 |
| May 27 | Nigeria | Confirmed | 19 | 7 | 2014-10-19 |
| May 20 | Nigeria | Confirmed | 19 | 7 | 2014-10-19 |
| May 13 | Nigeria | Confirmed | 19 | 7 | 2014-10-19 |
| May 06 | Nigeria | Confirmed | 19 | 7 | 2014-10-19 |
| Apr 29 | Nigeria | Confirmed | 19 | 7 | 2014-10-19 |
| Apr 22 | Nigeria | Confirmed | 19 | 7 | 2014-10-19 |
| Apr 15 | Nigeria | Confirmed | 19 | 7 | 2014-10-19 |
| Apr 08 | Nigeria | Confirmed | 19 | 7 | 2014-10-19 |
| Apr 01 | Nigeria | Confirmed | 19 | 7 | 2014-10-19 |
| Mar 25 | Nigeria | Confirmed | 19 | 7 | 2014-10-19 |
| Mar 18 | Nigeria | Confirmed | 19 | 7 | 2014-10-19 |
| Mar 11 | Nigeria | Confirmed | 19 | 7 | 2014-10-19 |
| Mar 4 | Nigeria | Confirmed | 19 | 7 | 2014-10-19 |
| Feb 25 | Nigeria | Confirmed | 19 | 7 | 2014-10-19 |
| Feb 18 | Nigeria | Confirmed | 19 | 7 | 2014-10-19 |
| Feb 11 | Nigeria | Confirmed | 19 | 7 | 2014-10-19 |
| Feb 04 | Nigeria | Confirmed | 19 | 7 | 2014-10-19 |
| Jan 28 | Nigeria | Confirmed | 19 | 7 | 2014-10-19 |
| Jan 21 | Nigeria | Confirmed | 19 | 7 | 2014-10-19 |
| Jan 14 | Nigeria | Confirmed | 19 | 7 | 2014-10-19 |
| Jan 07 | Nigeria | Confirmed | 19 | 7 | 2014-10-19 |
| Jan 06 | Nigeria | Confirmed | 19 | 7 | 2014-10-19 |
| Jan 05 | Nigeria | Confirmed | 19 | 7 | 2014-10-19 |
| Jan 02 | Nigeria | Confirmed | 19 | 7 | 2014-10-19 |
| Dec 31 | Nigeria | Confirmed | 19 | 7 | 2014-10-19 |
| Dec 30 | Nigeria | Confirmed | 19 | 7 | 2014-10-19 |
| Dec 29 | Nigeria | Confirmed | 19 | 7 | 2014-10-19 |
| Dec 26 | Nigeria | Confirmed | 19 | 7 | 2014-10-19 |
| Dec 24 | Nigeria | Confirmed | 19 | 7 | 2014-10-19 |
| Dec 23 | Nigeria | Confirmed | 19 | 7 | 2014-10-19 |
| Dec 22 | Nigeria | Confirmed | 19 | 7 | 2014-10-19 |
| Dec 19 | Nigeria | Confirmed | 19 | 7 | 2014-10-19 |
| Dec 17 | Nigeria | Confirmed | 19 | 7 | 2014-10-19 |
| Dec 15 | Nigeria | Confirmed | 19 | 7 | 2014-10-19 |
| Dec 10 | Nigeria | Confirmed | 19 | 7 | 2014-10-19 |
| Dec 03 | Nigeria | Confirmed | 19 | 7 | 2014-10-19 |
| Dec 01 | Nigeria | Confirmed | 19 | 7 | 2014-10-19 |
| Nov 26 | Nigeria | Confirmed | 19 | 7 | 2014-10-19 |
| Nov 21 | Nigeria | Confirmed | 19 | 7 | 2014-10-19 |
| Nov 19 | Nigeria | Confirmed | 19 | 7 | 2014-10-19 |
| Nov 14 | Nigeria | Confirmed | 19 | 7 | 2014-10-19 |
| Nov 12 | Nigeria | Confirmed | 19 | 7 | 2014-10-19 |
| Nov 7 | Nigeria | Confirmed | 19 | 7 | 2014-10-19 |
| Nov 5 | Nigeria | Confirmed | 19 | 7 | 2014-10-19 |
| Oct 31 | Nigeria | Confirmed | 19 | 7 | 2014-10-19 |
| Oct 29 | Nigeria | Confirmed | 19 | 7 | 2014-10-19 |
| Oct 25 | Nigeria | Confirmed | 19 | 7 | 2014-10-19 |
| Oct 22 | Nigeria | Confirmed | 19 | 7 | 2014-10-19 |
| Oct 17 | Nigeria | Confirmed | 19 | 7 | 2014-10-14 |
| Oct 15 | Nigeria | Confirmed | 19 | 7 | 2014-10-12 |
| Oct 10 | Nigeria | Confirmed | 19 | 7 | 2014-10-08 |
| Oct 8 | Nigeria | Confirmed | 19 | 7 | 2014-10-05 |
| Oct 3 | Nigeria | Confirmed | 19 | 7 | 2014-10-01 |
| Oct 1 | Nigeria | Confirmed | 19 | 7 | 2014-09-28 |
| Sep 26 | Nigeria | Confirmed | 19 | 7 | 2014-09-23 |
| Sep 24 | Nigeria | Confirmed | 19 | 7 | 2014-09-21 |
| Sep 22 | Nigeria | Confirmed | 19 | 7 | 2014-09-19 |
| Sep 18 | Nigeria | Confirmed | 19 | 7 | 2014-09-14 |
| Sep 16 2014 | Nigeria | Confirmed | 19 | 7 | 2014-09-13 |
| Sep 12 2014 | Nigeria | Confirmed | 19 | 7 | 2014-09-07 |
| Sep 8 2014 | Nigeria | Confirmed | 19 | 7 | 2014-09-06 |
| Sep 5 2014 | Nigeria | Confirmed | 18 | 7 | 2014-09-05 |
| Aug 29 2014 | Nigeria | Confirmed | 15 | 6 | 2014-08-25 |
filter() Example 2to perform an or use the single |
filter( ebola.data
, (Country == 'Nigeria') | (Country == 'Sierra Leone')
, `Case def.` == 'Confirmed'
)
an alternate form would be to use %in%
filter( ebola.data
, Country %in% c('Nigeria', 'Sierra Leone')
, `Case def.` == 'Confirmed'
)
distinct() - normalizingFrom the previous filter example note that report date is repeated week after week.
:::{.keyfunction} To get only distinct observations, use distinct(). :::
distinct() - normalizing`Confirmed Cases for Sierra Leone` <-
filter( ebola.data
, Country == 'Sierra Leone'
, `Case def.` == 'Confirmed'
) %>%
select(last_col(), Country, `Case def.`, starts_with('Total')) %>%
distinct()
| Country Report Date | Country | Case def. | Total cases | Total deaths |
|---|---|---|---|---|
| 2015-11-08 | Sierra Leone | Confirmed | 8704 | 3589 |
| 2015-11-01 | Sierra Leone | Confirmed | 8704 | 3589 |
| 2015-10-25 | Sierra Leone | Confirmed | 8704 | 3589 |
| 2015-10-18 | Sierra Leone | Confirmed | 8704 | 3589 |
| 2015-10-11 | Sierra Leone | Confirmed | 8704 | 3589 |
| 2015-10-04 | Sierra Leone | Confirmed | 8704 | 3589 |
| 2015-09-27 | Sierra Leone | Confirmed | 8704 | 3589 |
| 2015-09-20 | Sierra Leone | Confirmed | 8704 | 3589 |
| 2015-09-13 | Sierra Leone | Confirmed | 8704 | 3587 |
| 2015-09-06 | Sierra Leone | Confirmed | 8699 | 3587 |
| 2015-08-30 | Sierra Leone | Confirmed | 8698 | 3587 |
| 2015-08-23 | Sierra Leone | Confirmed | 8697 | 3586 |
| 2015-08-16 | Sierra Leone | Confirmed | 8697 | 3586 |
| 2015-08-09 | Sierra Leone | Confirmed | 8697 | 3585 |
| 2015-08-02 | Sierra Leone | Confirmed | 8695 | 3585 |
| 2015-07-26 | Sierra Leone | Confirmed | 8694 | 3585 |
| 2015-07-19 | Sierra Leone | Confirmed | 8692 | 3583 |
| 2015-07-12 | Sierra Leone | Confirmed | 8688 | 3581 |
| 2015-07-05 | Sierra Leone | Confirmed | 8674 | 3574 |
| 2015-06-28 | Sierra Leone | Confirmed | 8665 | 3566 |
| 2015-06-21 | Sierra Leone | Confirmed | 8657 | 3562 |
| 2015-06-14 | Sierra Leone | Confirmed | 8649 | 3553 |
| 2015-06-07 | Sierra Leone | Confirmed | 8635 | 3549 |
| 2015-05-31 | Sierra Leone | Confirmed | 8620 | 3546 |
| 2015-05-24 | Sierra Leone | Confirmed | 8608 | 3542 |
| 2015-05-17 | Sierra Leone | Confirmed | 8605 | 3541 |
| 2015-05-10 | Sierra Leone | Confirmed | 8597 | 3538 |
| 2015-05-03 | Sierra Leone | Confirmed | 8595 | 3537 |
| 2015-04-26 | Sierra Leone | Confirmed | 8586 | 3533 |
| 2015-04-19 | Sierra Leone | Confirmed | 8575 | 3511 |
| 2015-04-12 | Sierra Leone | Confirmed | 8563 | 3491 |
| 2015-04-05 | Sierra Leone | Confirmed | 8554 | 3465 |
| 2015-03-29 | Sierra Leone | Confirmed | 8545 | 3433 |
| 2015-03-22 | Sierra Leone | Confirmed | 8520 | 3381 |
| 2015-03-15 | Sierra Leone | Confirmed | 8487 | 3325 |
| 2015-03-08 | Sierra Leone | Confirmed | 8428 | 3263 |
| 2015-03-01 | Sierra Leone | Confirmed | 8370 | 3180 |
| 2015-02-22 | Sierra Leone | Confirmed | 8289 | 3095 |
| 2015-02-15 | Sierra Leone | Confirmed | 8212 | 3042 |
| 2015-02-08 | Sierra Leone | Confirmed | 8135 | 2975 |
| 2015-02-01 | Sierra Leone | Confirmed | 8059 | 2910 |
| 2015-01-25 | Sierra Leone | Confirmed | 7968 | 2833 |
| 2015-01-18 | Sierra Leone | Confirmed | 7903 | 2779 |
| 2015-01-11 | Sierra Leone | Confirmed | 7786 | 2696 |
| 2015-01-04 | Sierra Leone | Confirmed | 7602 | 2577 |
| 2015-01-03 | Sierra Leone | Confirmed | 7570 | 2549 |
| 2014-12-31 | Sierra Leone | Confirmed | 7476 | 2461 |
| 2014-12-28 | Sierra Leone | Confirmed | 7354 | 2392 |
| 2014-12-27 | Sierra Leone | Confirmed | 7326 | 2366 |
| 2014-12-24 | Sierra Leone | Confirmed | 7160 | 2289 |
| 2014-12-21 | Sierra Leone | Confirmed | 7017 | 2216 |
| 2014-12-20 | Sierra Leone | Confirmed | 6975 | 2190 |
| 2014-12-17 | Sierra Leone | Confirmed | 6856 | 2111 |
| 2014-12-14 | Sierra Leone | Confirmed | 6702 | 1876 |
| 2014-12-13 | Sierra Leone | Confirmed | 6638 | 1824 |
| 2014-12-07 | Sierra Leone | Confirmed | 6375 | 1559 |
| 2014-11-30 | Sierra Leone | Confirmed | 5978 | 1374 |
| 2014-11-28 | Sierra Leone | Confirmed | 5831 | 1321 |
| 2014-11-23 | Sierra Leone | Confirmed | 5441 | 1189 |
| 2014-11-18 | Sierra Leone | Confirmed | 5152 | 1058 |
| 2014-11-16 | Sierra Leone | Confirmed | 5056 | 1041 |
| 2014-11-11 | Sierra Leone | Confirmed | 4683 | 978 |
| 2014-11-09 | Sierra Leone | Confirmed | 4523 | 960 |
| 2014-11-04 | Sierra Leone | Confirmed | 4149 | 921 |
| 2014-11-02 | Sierra Leone | Confirmed | 4057 | 893 |
| 2014-10-29 | Sierra Leone | Confirmed | 3778 | 0 |
| 2014-10-27 | Sierra Leone | Confirmed | 3700 | 0 |
| 2014-10-22 | Sierra Leone | Confirmed | 3389 | 1008 |
| 2014-10-19 | Sierra Leone | Confirmed | 3223 | 986 |
| 2014-10-14 | Sierra Leone | Confirmed | 2977 | 932 |
| 2014-10-12 | Sierra Leone | Confirmed | 2849 | 926 |
| 2014-10-08 | Sierra Leone | Confirmed | 2593 | 753 |
| 2014-10-05 | Sierra Leone | Confirmed | 2455 | 725 |
| 2014-10-01 | Sierra Leone | Confirmed | 2179 | 575 |
| 2014-09-28 | Sierra Leone | Confirmed | 2076 | 574 |
| 2014-09-23 | Sierra Leone | Confirmed | 1816 | 557 |
| 2014-09-21 | Sierra Leone | Confirmed | 1745 | 552 |
| 2014-09-19 | Sierra Leone | Confirmed | 1640 | 545 |
| 2014-09-14 | Sierra Leone | Confirmed | 1513 | 517 |
| 2014-09-13 | Sierra Leone | Confirmed | 1464 | 514 |
| 2014-09-07 | Sierra Leone | Confirmed | 1287 | 478 |
| 2014-09-06 | Sierra Leone | Confirmed | 1234 | 461 |
| 2014-09-05 | Sierra Leone | Confirmed | 1146 | 443 |
| 2014-08-25 | Sierra Leone | Confirmed | 935 | 380 |
To sort data use arrange()
sort() is the base version but again, less robust.
Arrange allows you to give sorting criteria.
arrange() Example`Confirmed Cases for Sierra Leone` %>%
arrange(`Country Report Date`, desc(`Case def.`)) %>%
head()
| Country Report Date | Country | Case def. | Total cases | Total deaths |
|---|---|---|---|---|
| 2014-08-25 | Sierra Leone | Confirmed | 935 | 380 |
| 2014-09-05 | Sierra Leone | Confirmed | 1146 | 443 |
| 2014-09-06 | Sierra Leone | Confirmed | 1234 | 461 |
| 2014-09-07 | Sierra Leone | Confirmed | 1287 | 478 |
| 2014-09-13 | Sierra Leone | Confirmed | 1464 | 514 |
| 2014-09-14 | Sierra Leone | Confirmed | 1513 | 517 |
We would like to make Case def. separate columns there are however 2 possible response variables:
Options?
Case Def. Then merge those together.spread the column.# Make subsets
confirmed <- ebola.data %>% filter(`Case def.` == 'Confirmed') %>%
select(-`Case def.`) %>% distinct() %>%
rename_at(vars(starts_with("total")), ~paste("Confirmed", .))
probable <- ebola.data %>% filter(`Case def.` == 'Probable') %>%
select(-`Case def.`) %>% distinct() %>%
rename_at(vars(starts_with("total")), ~paste("Probable", .))
suspected <- ebola.data %>% filter(`Case def.` == 'Suspected') %>%
select(-`Case def.`) %>% distinct() %>%
rename_at(vars(starts_with("total")), ~paste("Suspected", .))
all.cases <- ebola.data %>% filter(`Case def.` == 'All') %>%
select(-`Case def.`) %>% distinct() %>%
rename_at(vars(starts_with("total")), ~paste("All", .))
# Join together
ebola.option1 <-
confirmed %>%
full_join(probable) %>%
full_join(suspected) %>%
full_join(all.cases)
glimpse(ebola.option1)
## Rows: 838 ## Columns: 11 ## $ SheetName <chr> "Jan 06, 2016", "Jan 0... ## $ Country <chr> "Guinea", "Liberia", "... ## $ `Confirmed Total cases` <int> 3351, 3151, 9, 8704, 1... ## $ `Confirmed Total deaths` <int> 2083, 0, 3, 3589, 0, 0... ## $ `Country Report Date` <dttm> 2015-12-27, 2015-05-0... ## $ `Probable Total cases` <int> 453, 1879, 0, 287, 0, ... ## $ `Probable Total deaths` <int> 453, 0, 0, 208, 0, 0, ... ## $ `Suspected Total cases` <int> 0, 5636, 0, 5131, 0, 0... ## $ `Suspected Total deaths` <int> 0, 0, 0, 158, 0, 0, 0,... ## $ `All Total cases` <int> 3804, 10666, 9, 14122,... ## $ `All Total deaths` <int> 2536, 4806, 3, 3955, 0...
Use the join family of functions to merge data together:
inner_join(a, b) - keep only rows that match both a and b.left_join(a, b) - keep all rows of a and add columns in b to the rows that match. Unmatched rows will contain missing values.right_join(a, b) - same as left but swap a and b.full_join(a, b) - keep all rows of both a and b.semi_join(a, b) - keep all rows of a that match b, but don’t add columns from b.anti_join(a, b) - keep only those rows of a that don’t match b.Operations have these parameters:
by - variables to join on, defaults to common variablessuffix - sufixes to add to distinguish common variables that are not part of byebola.option2 <-
ebola.data %>%
select(SheetName, Country, `Case def.`, `Total cases`, `Country Report Date`) %>%
tidyr::spread('Case def.', 'Total cases')
glimpse(ebola.option2)
## Rows: 838 ## Columns: 7 ## $ SheetName <chr> "Apr 01", "Apr 01", "Apr ... ## $ Country <chr> "Guinea", "Liberia", "Mal... ## $ `Country Report Date` <dttm> 2015-03-29, 2015-03-29, ... ## $ All <int> 3492, 9712, 8, 20, 1, 119... ## $ Confirmed <int> 3068, 3151, 7, 19, 1, 854... ## $ Probable <int> 414, 1879, 1, 1, 0, 287, ... ## $ Suspected <int> 10, 4682, 0, 0, 0, 3142, ...
# spread each
cases <- ebola.data %>% select(-`Total deaths`) %>%
tidyr::spread('Case def.', 'Total cases')
deaths <- ebola.data %>% select(-`Total cases`) %>%
tidyr::spread('Case def.', 'Total deaths')
ebola.option3 <-
full_join( cases, deaths
, c('SheetName', 'Country', 'Country Report Date')
, suffix = c(".cases", ".deaths"))
glimpse(ebola.option3)
## Rows: 838 ## Columns: 11 ## $ SheetName <chr> "Apr 01", "Apr 01", "Apr ... ## $ Country <chr> "Guinea", "Liberia", "Mal... ## $ `Country Report Date` <dttm> 2015-03-29, 2015-03-29, ... ## $ All.cases <int> 3492, 9712, 8, 20, 1, 119... ## $ Confirmed.cases <int> 3068, 3151, 7, 19, 1, 854... ## $ Probable.cases <int> 414, 1879, 1, 1, 0, 287, ... ## $ Suspected.cases <int> 10, 4682, 0, 0, 0, 3142, ... ## $ All.deaths <int> 2314, 4332, 6, 8, 0, 3799... ## $ Confirmed.deaths <int> 1900, 0, 0, 7, 0, 3433, 0... ## $ Probable.deaths <int> 414, 0, 0, 1, 0, 208, 0, ... ## $ Suspected.deaths <int> 0, 0, 0, 0, 0, 158, 0, 0,...
summarise(data, ...)Take the data and summarise it by performing the ... operations to it.
summarize( ebola.option3
, 'Observations' = n()
, 'Number of countries' = n_distinct(Country)
, "# of Reporting dates" = n_distinct(`Country Report Date`)
, max.cases = max(All.cases, na.rm=TRUE)
, max.deaths = max(All.deaths, na.rm=TRUE)
)
| Observations | Number of countries | # of Reporting dates | max.cases | max.deaths |
|---|---|---|---|---|
| 838 | 11 | 120 | 14122 | 4806 |
group_by(data, ...)Take the data and group it by variables specified in ...,
all subsequent operations should be done by group.
ebola.option3 %>% group_by(Country) %>%
summarise( "# of Reporting dates" = n_distinct(`Country Report Date`)
, max.cases = max(All.cases, na.rm=TRUE)
, max.deaths = max(All.deaths, na.rm=TRUE)
)
| Country | # of Reporting dates | max.cases | max.deaths |
|---|---|---|---|
| Guinea | 91 | 3810 | 2536 |
| Italy | 26 | 1 | 0 |
| Liberia | 53 | 10666 | 4806 |
| Liberia2 | 27 | 9 | 3 |
| Mali | 68 | 8 | 6 |
| Nigeria | 16 | 22 | 8 |
| Senegal | 15 | 3 | 0 |
| Sierra Leone | 84 | 14122 | 3955 |
| Spain | 16 | 1 | 0 |
| United Kingdom | 47 | 1 | 0 |
| United States of America | 74 | 4 | 1 |
Remember the wide data problem?
Time to go back
| 1. | Decide on the variable of interest, our value |
| 1.5 | Summarise to reduce the data to one row per country x reporting date |
| 2. | spread out the number of cases by date. |
5:00
What should we do with our data?
This data set on it’s own is not very interesting.
Let’s build something interesting.
The wbstats package provides access to the world bank database.
library(wbstats)
wbsearch('population', extra=TRUE)
wb(indicator ="SP.POP.TOTL", startdate = 2014, enddate=2014)
Run these commands investigate the output and let’s discuss.
To merge the the world bank data to our ebola data we need a common country variable.
library(countrycode)
long.ebola <- mutate( ebola.option3
, iso3c = countrycode(Country, "country.name", "iso3c"))
assert_that(!any(is.na(long.ebola$iso3c)))
## [1] TRUE
SP.URB.TOTL.ZS - Percentage of Population in Urban Areas (in % of Total Population)SP.POP.TOTL.MA.ZS - Population, male (% of total)SP.POP.TOTL - Population, totalEN.POP.DNST - Population density (people per sq km)IN.POV.HCR.EST.TOTL - Poverty HCR Estimates (%) - TotalNY.GDP.PCAP.CD - GDP per capita (current US$)pop.vars <- c( 'SP.URB.TOTL.ZS', 'SP.POP.TOTL.MA.ZS'
, 'SP.POP.TOTL', 'EN.POP.DNST'
, 'IN.POV.HCR.EST.TOTL', 'NY.GDP.PCAP.CD')
pop.data <- wb( country = unique(long.ebola$iso3c)
, indicator = pop.vars
, startdate = 2014, enddate=2014)
What format is it in?
Are there any problems?
Did we get get everything we expected?
meta.pop.data <- select(pop.data, variable=indicatorID, label=indicator) %>% distinct()
our.data <-
pop.data %>%
select(iso3c, value, indicatorID) %>%
spread(indicatorID, value) %>%
right_join(long.ebola)
Create a table 1
15:00